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utline 


■  Short  vector  extensions 

■  Digital  signal  processing  (DSP)  transforms 

■  SPIRAL 

■  Vectorization  of  SPL  formulas 

■  Experimental  results 
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SPIRAL 


IMD  Short  Vector  Extensions 


vector  length  =  4 
(4-way) 


] 


I 


] 


■  Extension  to  instruction  set  architecture 

■  Available  on  most  current  architectures 

■  Originally  for  multimedia  (like  MMX  for  integers) 

■  Requires  fine  grain  parallelism 

■  Large  potential  speed-up 


Name 

n-way 

Precision 

Processors 

SSE 

4-way 

float 

Intel  Pentium  III  and  4,  AMD  AthlonXP 

SSE2 

2 -way 

double 

Intel  Pentium  4 

3DNow! 

2 -way 

float 

AMD  K6,  K7,  AthlonXP 

AltiVec 

4-way 

float 

Motorola  G4 

IPF 

2 -way 

Float 

Intel  Itanium,  Itanium  2 
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roblems 

■  SIMD  instructions  are  architecture  specific 

■  No  common  API  (usually  assembly  hand  coding) 

■  Performance  very  sensitive  to  memory  access 

■  Automatic  vectorization  (by  compilers)  very  limited 

Requires  expert  programmers 


Our  Goal:  Automation  for  digital  signal 

processing  (DSP)  transforms 
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SP  (digital  signal  processing)  transforms 

sampled  signal  (a  vector)  ^.transform  (a  matrix) 

X  I — ^  Atx 

ixample:  Discrete  Fourier  Transform  (DFT)  size  4 
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DFT4  =  ( DFT2  ®  I2)D{I2  ®  DFT2)P 

■  Fast  algorithm  =  product  of  structured  sparse  matrices 

■  Represented  as  formula  using  few  constructs  (e.g.,  ®)  and 
primitives  (diagonal,  permutation) 

■  Captures  a  large  class  of  transforms  (DFT,  DCT,  wavelets, 
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ensor  (Kronecker)  Product  of  Matrices 


A 


coarse 
structure 


b  -  iakiB]k,i 

structure 


for  A  -  [akl  ]kl 


Examples: 
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key  construct  in  many  DSP  transform 
algorithms  (DFT,  WHT,  all  multidimensional) 
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Observation: 

•  For  a  given  transform  there  are  maaaany  different  algorithms 
(equal  in  arithmetic  cost,  differ  in  data  flow) 

•  The  best  algorithm  and  its  implementation  is  platform-dependent 

•  It  is  not  clear  what  the  best  algorithm/implementation  is 


SPIRAL: 


Automatic  algorithm  generation 
+  Automatic  translation  into  code 
+  Intelligent  search  for  “best” 


=  generated  platform-adapted  implementation 
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PIRAL’s  Mathematical  Framework 


Transform  DFTn  parameterized  matrix 


Rule 


DFT  {DFT  ®  I W)D{I„®  DFTm ) •  P 

•  a  breakdown  strategy 

•  product  of  sparse  matrices 


Formula  DFT,  6  =  {DFT,  ®  /.)•  t!6  ■  (I,  ®  DFT  A  L 


16 


16  -  y1^1  ^  4  ^  M/  4  V  4  4  /  4 

by  recursive  application  of  rules 
few  constructs  and  primitives 
can  be  translated  into  code 


Used  as  mathematical  high-level 
representation  of  algorithms 
(SPL  =  signal  processing  language) 
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SPIRAL 


PIRAL  system 


DSP  transform  (user  specified) 


SPIRAL 


Formula  Generator  ^ 


fast  algorithm 
as  SPL  formula 


controls 

algorithm  generation 


controls 

ementation  options 


runtime  on  given  platform 


a> 
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platform -adapted 
implementation 
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Our  Goal:  extend  SPL  compiler 
to  generate  vector  code 


SPIRAL 


enerating  SIMD  Code  from  SPL  Formulas 


Example: 

y  :=  (A®  I4)x 

/ 

vector  length 


naturally  represents 
vector  operation 


(Current)  generic  construct  completely  vectorizable: 


Pp  Qi  permutations 
DP  Et  diagonals 
A i  arbitrary  formulas 
f  SIMD  vector  length 


■  Formulas  contain  all  structural  information  for  vectorization 

■  Construct  above  captures  DFT,  WHT,  all  multi-dimensional 
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he  Approach 

■  Use  macro  layer  as  API  to  hide  machine  specifics 

■  Vector  code  generation  in  two  steps 

1.  Symbolic  vectorization  (formula  manipulation) 

2.  Code  generation 
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ymbolic  Vectorization 


SPIRAL 


dft16  =  0 dft4  ®  /4  )•  r416  •  (/4  ®  DFr4  )•  z'46 

I  Formula  manipulation 
\  (automatic  using  manipulation  rules) 

DFT  i6  =  ((/4  ®  L\)-  (DFT  4  ®  /4).  rf). 

((/4  ®  L\  X46  ®  /2)(/4  ®  z*  )•  (dft4  ®  /4).  (/4  ®  4 )) 
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Pattern 

matching 


■  Manipulate  to  match  vectorizable  construct 

■  Separate  vectorizable  parts  and  scalar  parts 
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ormula  Manipulation 


Normalizing  formulas 

(I  ®  L2v  )(I  ®L2v)=  L  I  -I  ©  I 

A®  B=  (A®  I)(I®  B)  I  =1  ®  / 

I  ®  A=Lnv(A®I  )Ln:  PD  =  D'P 

v  v  ^  v  7  n 

Converting  complex  to  real  arithmetic 

A~B=AB 


A=  A®  I2,  A  real 
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If 


ector  Code  Generation 


k 


fuse  with  load/store  operations 

difficult  part 

(easy  to  loose  performance) 


i=l 


Pp  Qi 

permutations 

Dp  Ei 

diagonals 

arbitrary  formulas 

S 

1 

SIMD  vector  length 

arithmetic  vector  instructions 

•  use  standard  SPL  compiler  ox\At 

•  replace  scalar  with  vector  instructions 

easy  part 

(due  to  existing  SPL  compiler) 
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hallenge  I  Data  Access 

Example: 

Required:  Available: 


Strided  load  of  complex  numbers  Vector  load  plus  in  -register  permutations 


■  highest  performance  code  requires  properly  aligned  data  access 
-  permutation  support  differs  between  architectures 

■  performance  differs  between  permutations  (some  are  good,  most  very  bad) 
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■  use  formula  manipulation  to  get  “good”  permutations 
-  macro  layer  API  for  efficient  and  machine  transparent 
implementation 


SPIRAL 


ortable  High-level  API 

■  restricted  set  of  short  vector  operations 

■  requires  C  compiler  with  „intrinsics“-interface 

■  high-level  operations 

■  Vector  arithmetic  operations 

■  Vector  load/store  operations 

■  Special  and  arbitrary  multi-vector  permutations 

■  Vector  constant  handling  (declaration,  usage) 

■  Implemented  by  C  macros 

Example: 

Unit-stride  load  of  4  complex  numbers: 


LOAD  L  8  2(regl,  reg2,  *mem) 
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ortable  SIMD  API:  Details 


SPIRAL 


All  SIMD  extensions  supported: 

■  gcc  3.0,  gcc-vec 

■  Intel  C++  Compiler,  MS  VisualC++  with  ProcessorPack 

■  Various  PowerPC  compilers  (Motorola  standard) 


Examples: 

Reverse  load  of  4  real  numbers: 
LOAD  J  4(reg,  *mem) 


Reverse  load  of  4  complex  numbers: 
LOAD  J  4  x  I  2(rl,  r2,  *mem) 
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enerated  code 


■  Vector  parts:  portable  SIMD  API 

■  Scalar  parts:  standard  C 

■  Pt,  Q,,  Z),  Et  handled  by  load/store  operations 

■  A,  handled  by  vector  arithmetics 


/*  Example  vector  code:  DFT_16  */ 
void  DFT_16 (vector_f loat  *y, 
vector_float  *x) 

{ 

vector_f loat  xlO ,  xll ,  xl2 ; 

LOAD_VECT (xlO ,  x  +  0); 

LOAD_VECT (xl4 ,  x  +  16); 
fO  =  SIMD_SUB (xlO ,  xl4) ; 
LOAD_VECT (xll ,  x  +  4); 

LOAD_VECT (xl5 ,  x  +  20); 
fl  =  SIMD_SUB (xll ,  xl5) ; 

yl7  =  SIMD_SUB ( f 1 ,  f4) ; 

STORE_L_8_4 (yl6,  yl7 ,  y  +  24); 

yl2  =  SIMD_SUB (fO ,  f5) ; 

yl3  =  SIMD_ADD (f 1 ,  f4) ; 
STORE_L_8_4 (yl2 ,  yl3 ,  y  +  8) ; 
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/*  Intel  SSE:  portable  SIMD  API 
Intel  C++  Compiler  5 . 0 

*/ 

typedef  _ ml28  vector_f loat ; 

#def ine  LOAD_VECT ( a ,  b)  \ 

(a)  =  *  (b) 

#define  SIMD_ADD (a ,  b)  \ 

_mm_add_ps ( (a) ,  (b) ) 

#def ine  SIMD_SUB(a,  b)  \ 

_mm_sub_ps ( (a) ,  (b) ) 

#define  STORE_L_8_4 (re ,  im,  out)  \ 

{  \ 

vector_float  _sttmpl ,_sttmp2 ;  \ 

_sttmpl  =  _mm_unpacklo_ps (re ,  im) ;  \ 

_sttmp2  =  _mm_unpackhi  j>s  (re ,  im)  ;  \ 

_mm_store_ps (out ,  _sttmpl) ;  \ 

_mm_store_ps ( (out)  +  VLEN,  _sttmp2) ;  \ 

} 


xperimental  Results 


■  our  code  is  generated,  found  by  dynamic  programming  search 

■  different  searches  for  different  types  of  code  (scalar,  vector) 

■  results  in  (Pseudo)  gigaflops  (higher  =  better) 
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(Pseudo)  gflops 


enerated  DFT  Code:  Pentium  4, 


DFT  2n  single  precision,  Pentium  4,  2.53  GHz,  using  Intel  C  compiler  6.0 


SPIRAL 


speedups  (to  C  code)  up  to  factor  of  3.1 
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gflops 


enerated  DFT  Code:  Pentium  4,  SSE2 


Intel  MKL  split 
Spiral  SSE2 


FFTW  2.1.3 


*  Spiral  C 

*  Spiral  C  vect 

-  ■■  -  Spiral  SSE2  exh. 


DFT  2n  double  precision,  Pentium  4,  2.53  GHz,  using  Intel  C  compiler  6.0 
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speedups  (to  C  code)  up  to  factor  of  1.8 


gflops 


enerated  DFT  Code:  Pentium  III,  SSE 


Intel  MKL 
Spiral  SSE 
Spiral  C  vect 
Spiral  C 
FFTW  2.1.3 


SPIRAL. 


DFT  2"  single  precision,  Pentium  III,  1  GHz,  using  Intel  C  compiler  6.0 

speedups  (to  C  code)  up  to  factor  of  2.1 
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gflops 


enerated  DFT  Code:  Athlon  XP,  SSE 


-♦-Spiral  SSE 
Intel  MKL  P3 
-*-FFTW  2.1.3 
Intel  MKL  P4 
-■-Spiral 
-■-Spiral  C  vect 
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DFT  2"  single  precision,  Pentium  III,  1  GHz,  using  Intel  C  compiler  6.0 


speedups  (to  C  code)  up  to  factor  of  1.6 
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gflops 


>ther  transforms 


n 


2-dim  DCT  2nx2n 

Pentium  4,  2.53  GHz,  SSE 

4.5 

4 

3.5 

3 

-■-Spiral  float  25 
-■-Spiral  C  vect 
I  Spiral  C  SSE  ^ 

1.5 

1 

0.5 


7  8  9  10  11 

12  13  14 

4x4 

8x8 

16x16  32x32  64x64 

128x128 

transform  size 

transform  size 

-  WHT  has  only  additions 
■  very  simple  transform 

speedups  (to  C  code)  up  to  factor  of  3 

SPIRAL 


WHT  2 

Pentium  4,  2.53  GHz,  SSE 
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ifferent  search  strategies 


DFT  2n  single  precision,  Pentium  4,  2.53  GHz,  using  Intel  C  compiler  6.0 
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standard  DP  looses  up  to  25  %  performance 
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est  DFT  Trees,  size  210  =  1024 


Pentium  4 
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trees  platform/datatype  dependent 


rosstiming  of  best  trees  on  Pentium  4 


-Pentium  4  SSE 
Pentium  4  SSE2 
AthlonXP  SSE 
Pentiumlll  SSE 
■Pentium  4  float 


SPIRAL. 


DFT  2"  single  precision,  runtime  of  best  found  of  other  platforms 


software  adaptation  is  necessary 
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ummary 

■  Automatically  generated  vectorized  DSP  code 

■  Code  is  platform-adapted  (SPIRAL) 

■  We  implement  “constructs”,  not  transforms 

■  tensor  product,  permutations, ... 

■  DFT,  WHT,  arbitrary  multi-dim  supported 

■  Very  competitive  performance 
Ongoing  work: 

■  port  to  other  SIMD  architectures 

■  include  filters  and  wavelets 
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